An in-depth analysis to measure the effectiveness of promotions and how they influence the customer’s buying patterns. The analysis also encompasses the demographic analysis, sales analysis and the product sales effectiveness to the promotions.

Content

1. Introduction

The research is based on the the effectiveness of a promotion - Type A. The goal of the study is to derive insights from the available ‘Customer Full Journey’ datasets relating to customers, transactions, and how these factors influence the sales when the promotion was active. As part of our methodology we plan to clean, tidy, explore, and analyze the datasets we received to finally answer the following questions of interest -

Question #1: Was the customer responsive to the promotion/campaign ran during a specific period?

Question #2: Was holiday a driver of sales or was it the promotion that helped in acquiring new customer, retaining existing customer and overall increasing customer enagement?

Question #3: To what degree was the promotion effective?

Question #4: How were the customer purchase pattern different when the promotion was active VS. when the promotion was inactive?

Question #5: What was the Expenditure Per Hour (EPH) when the promotion was active VS. when the promotion was inactive?

Question #6: What are the common demographics of the customers who displayed disparate purchasing pattern when the promotion was active VS. when the promotion was inactive?

Question #7: What are the top product categories that experienced uptick/decline in sales and sales volume?

Our methodology can be broadly classified into three steps -

Data cleaning: Clean and tidy the tables/datasets in R, and make data ready for analysis

Data Exploration: Explore the nature and source of data, explore missing values and reasons ; identify abnormal values and outliers, and visualize variable distributions.

Data Analysis: Applying relevant joins, filters, aggregation methods and calculations to derive KPIs and effective measure to derive meaningful insights. We propose a comparative analysis of sales and sales volume, followed by descriptive insights on the demographics and products.

This research will help the client in two major ways. Firstly, understand better as to what factors impact the success of a promotion and better understand the demographics and product category influenced by the promotion. The insights can be further used to hyperpersonalize marketing initiatives to target specific customer segment in the future. Secondly, design better promotions against different product types and demographic features of the customers powered by the knowledge of customer engagement (and spending) levels with various product categories.

2. Packages

Packages to be installed

Following Packages are needed for the project-

completejourney - Retail shopping transactions for 2,469 households over one year;
tidyverse - Included for data tidying ; Functions-To use gather and spread function
ggplot2 - To create Data Visualizations
plotly - To create Data Visualizations
patchwork - To combine separate ggplots into the same graphic
lubridate - To analyse Date-Time data
viridis - To use different color palettes
knitr - To do formatting in R markdown
kableExtra - To do formatting in R markdown
reshape2 - To reshape data
ggalluvial - To produce alluvial plots in a tidyverse framework

suppressWarnings(suppressMessages(library(completejourney)))
suppressWarnings(suppressMessages(library(tidyverse)))
suppressWarnings(suppressMessages(library(plotly)))
suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(patchwork)))
suppressWarnings(suppressMessages(library(lubridate)))
suppressWarnings(suppressMessages(library(viridis)))
suppressWarnings(suppressMessages(library(reshape2)))
suppressWarnings(suppressMessages(library(ggalluvial)))

3. Data Sources & Pre-Processing

3.1. Data Source : The original data on the retail client was downloaded from the Data Wrangling course folders

3.2.1 Introduction of source data: The dataset primarily tracks the transaction level data from 2,500 households who frequently shop at the client’s stores. Moreover, the data captures purchases made across the entire gamut of product categories available at the retailer’s stores. Among other important information, we have customer demographics data and direct marketing campaigns data available for certain households. (Please Note: The terms ‘Customer’ and ‘Household’ have been assumed synonymous and used interchangeably in this report)

Timeline: The data tracks customer transactions over a period of one year (52 weeks).

3.2.2 Explanation of source data: (tables and variables) In total, this comprehensive total customer journey had 8 different data frames. Each of these are described in brief below. Also a small glimpse into each of these data is given below

a) campaigns: campaigns received by each household

campaign_id household_id
1 105
1 1238
1 1258
1 1483
1 2200

b) campaign_descriptions: campaign metadata (length of time active)

campaign_id campaign_type start_date end_date
1 Type B 2017-03-03 2017-04-09
2 Type B 2017-03-08 2017-04-09
3 Type C 2017-03-13 2017-05-08
4 Type B 2017-03-29 2017-04-30
5 Type B 2017-04-03 2017-05-07

c) coupons: coupon metadata (UPC code, campaign, etc.)

coupon_upc product_id campaign_id
10000085207 9676830 26
10000085207 9676943 26
10000085207 9676944 26
10000085207 9676947 26
10000085207 9677008 26

d) coupon_redemptions: coupon redemptions (household, day, UPC code, campaign)

coupon_upc product_id campaign_id
10000085207 9676830 26
10000085207 9676943 26
10000085207 9676944 26
10000085207 9676947 26
10000085207 9677008 26

e) demographics: household demographic data (age, income, family size, etc.)

household_id age income home_ownership marital_status household_size household_comp kids_count
1 65+ 35-49K Homeowner Married 2 2 Adults No Kids 0
1001 45-54 50-74K Homeowner Unmarried 1 1 Adult No Kids 0
1003 35-44 25-34K NA Unmarried 1 1 Adult No Kids 0
1004 25-34 15-24K NA Unmarried 1 1 Adult No Kids 0
101 45-54 Under 15K Homeowner Married 4 2 Adults Kids 2

f) products: product metadata (brand, description, etc.)

product_id manufacturer_id department brand product_category product_type package_size
25671 2 GROCERY National FRZN ICE ICE - CRUSHED/CUBED 22 LB
26081 2 MISCELLANEOUS National NA NA NA
26093 69 PASTRY Private BREAD BREAD:ITALIAN/FRENCH NA
26190 69 GROCERY Private FRUIT - SHELF STABLE APPLE SAUCE 50 OZ
26355 69 GROCERY Private COOKIES/CONES SPECIALTY COOKIES 14 OZ

g) promotions_sample: a sampling of the product placement in mailers and in stores corresponding to advertising campaigns

product_id store_id display_location mailer_location week
1000050 337 3 0 1
1000092 317 0 A 1
1000214 317 6 0 1
1000235 317 0 A 1
1000235 337 0 A 1

h) transactions_sample: a sampling of the products purchased by households

household_id store_id basket_id product_id quantity sales_value retail_disc coupon_disc coupon_match_disc week transaction_timestamp
2261 309 31625220889 940996 1 3.86 0.43 0 0 5 2017-01-28 14:06:53
2131 368 32053127496 873902 1 1.59 0.90 0 0 10 2017-02-28 22:31:57
511 316 32445856036 847901 1 1.00 0.69 0 0 13 2017-03-26 13:22:21
400 388 31932241118 13094913 2 11.87 2.90 0 0 8 2017-02-18 13:13:10
918 340 32074655895 1085604 1 1.29 0.00 0 0 10 2017-03-02 15:05:57

4. Data Preparation

4.1 Tables of concern for Problem #1:

- transactions data
- campaign_descriptions data
- coupons_redemption data
- products data

4.2 Data Preparation for insight generation:

a. Joining datasets

  • Join the transactions, coupons and campaign_desc tables
  • filtering out null values after the join as they represent mismatched timestamps
  • transactions displayed where Campaign A was effective VS. transactions without promotion period
transactions <- get_transactions()
promotions <- get_promotions()

transactions <- transactions %>%
  mutate(transaction_date = as.Date(transactions$transaction_timestamp))

coupon_redemptions <- coupon_redemptions %>%
  left_join(campaign_descriptions, by = "campaign_id") %>%
  full_join(coupons, by = c("coupon_upc", "campaign_id"))

transactions_involving_coupons <- transactions %>%
  left_join(coupon_redemptions, by = c("household_id", "product_id", "transaction_date" = "redemption_date")) %>%
  filter(!is.na(campaign_id))

transactions_involving_coupons%>% 
  glimpse() 
## Rows: 3,538
## Columns: 17
## $ household_id          <chr> "1029", "165", "712", "2488", "1923", "1923", "1…
## $ store_id              <chr> "438", "321", "448", "324", "369", "369", "369",…
## $ basket_id             <chr> "31198980681", "31242775833", "31316965685", "31…
## $ product_id            <chr> "9194206", "1090676", "9677486", "12171765", "84…
## $ quantity              <dbl> 2, 2, 2, 1, 1, 1, 6, 1, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ sales_value           <dbl> 4.28, 3.00, 5.54, 2.00, 0.25, 0.50, 2.60, 2.00, …
## $ retail_disc           <dbl> 2.10, 1.58, 1.84, 0.49, 0.19, 0.19, 1.14, 1.99, …
## $ coupon_disc           <dbl> 0.00, 1.00, 0.00, 0.50, 0.25, 0.00, 1.00, 1.00, …
## $ coupon_match_disc     <dbl> 0.00, 0.00, 0.00, 0.50, 0.25, 0.00, 1.00, 0.00, …
## $ week                  <int> 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 7, 7, …
## $ transaction_timestamp <dttm> 2017-01-01 11:58:48, 2017-01-03 12:03:19, 2017-…
## $ transaction_date      <date> 2017-01-01, 2017-01-03, 2017-01-07, 2017-01-10,…
## $ coupon_upc            <chr> "51380041313", "53377610033", "54300016033", "51…
## $ campaign_id           <chr> "26", "26", "26", "26", "26", "26", "26", "26", …
## $ campaign_type         <ord> Type B, Type B, Type B, Type B, Type B, Type B, …
## $ start_date            <date> 2016-12-28, 2016-12-28, 2016-12-28, 2016-12-28,…
## $ end_date              <date> 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19,…

b. Further data preparation for comparative analysis, analyzing customer purchase pattern, and product sales

campaign_df %>% 
  filter(Campaign == "Not Campaign A") %>%
  arrange(desc(Total_Sales)) %>% 
  head(10) %>% 
  kbl() %>%
  kable_styling()
basket_id Campaign Sales_Volume Total_Sales
33015981495 Not Campaign A 3 36.97
41479822453 Not Campaign A 14 34.86
40300451090 Not Campaign A 9 33.00
40853292449 Not Campaign A 4 31.96
41338486706 Not Campaign A 17 29.68
40800778950 Not Campaign A 3 27.05
41109557898 Not Campaign A 1 24.99
41259462119 Not Campaign A 1 24.99
40955620515 Not Campaign A 3 23.56
41479740437 Not Campaign A 14 22.68
campaign_df %>% 
  filter(Campaign == "Not Campaign A") %>%
  summary() %>% 
  kbl() %>%
  kable_styling()
basket_id Campaign Sales_Volume Total_Sales
Length:155 Length:155 Min. : 1.000 Min. : 0.600
Class :character Class :character 1st Qu.: 1.000 1st Qu.: 2.830
Mode :character Mode :character Median : 2.000 Median : 5.000
NA NA Mean : 2.813 Mean : 7.176
NA NA 3rd Qu.: 3.000 3rd Qu.: 9.040
NA NA Max. :17.000 Max. :36.970

Note

- Total Rows & Columns: 155,4 (respectively)
- Total Sales: Depict the total sales with corresponding to each basket ID
- Sales Volume: Depict the quantity purchased corresponding to each basket ID

c. Calculation of Expenditure per Hour:

Expenditure Per Hour (EPH) is a measure coefficient to measure how effective was the sales for a given period of time. The duration can be defined as the time when the promotion campaign was active or the time when there was no promotion active. We have considered EPH on hourly basis to calculate dollar value per hour. EPH is derived by the following formula:

\(\sum (No. of Transactions)/\sum (Total duration)\)

After joining different tables and grouping the aggregation algorithm, we have considered unique households that shopped when Campaign A was active and the same set that shopped where there was no promotion in place. We agreed that EPH would be the best indicator of sales in both the periods (Campaign A active Vs. Not Active) only when we consider unique households that shopped during both the duration.

transactions_eph <- campaign_df_clean %>%
  mutate(Campaign = case_when(campaign_type == "Type A" ~ "Campaign A",
                              TRUE ~ "Not Campaign A"),
         Month = month(transaction_timestamp),
         Holiday = case_when(Month %in% c(10,11,12,1) ~ "Holiday Season",
                             TRUE ~ "Not Holiday Season")) %>%
  group_by(household_id,Campaign,Holiday,transaction_date) %>%
  summarise(Total_Sales = sum(sales_value, na.rm = T)) %>%
  group_by(household_id,Campaign,Holiday) %>%
  summarise(eph = mean(Total_Sales, na.rm = T)/24) %>%
  select(household_id,Campaign,Holiday,eph)
  transactions_eph %>% 
  head(10) %>% 
    kbl() %>%
  kable_styling()
household_id Campaign Holiday eph
1 Campaign A Holiday Season 0.9400000
1 Campaign A Not Holiday Season 0.5216667
1000 Campaign A Not Holiday Season 0.4616667
1005 Campaign A Not Holiday Season 0.3818750
101 Campaign A Holiday Season 0.2725000
101 Campaign A Not Holiday Season 1.7054167
1011 Campaign A Not Holiday Season 0.9545833
1012 Campaign A Not Holiday Season 0.0891667
1015 Campaign A Holiday Season 0.6512500
1015 Campaign A Not Holiday Season 0.0854167

c. Calculation of Average Basket Value

Average basket Value: Average basket value, also known as units per transaction, refers to the average number of items sold per single transaction. This is calculated by dividing the total value of all transactions by the number of transactions or sales.

\(\sum (Total Value of Transactions)/\sum (NumberofTransactions)\)

d. Campaign Durations

Campaign Durations are listed below:

Campaign A ran for total 196 days. With each duration listed below:

campaign_descriptions %>%
  filter(campaign_type == "Type A") %>%
  mutate(date_diff = as.numeric(difftime(end_date, start_date, units = "days"))) %>% 
     kbl() %>%
  kable_styling()
campaign_id campaign_type start_date end_date date_diff
8 Type A 2017-05-08 2017-06-25 48
13 Type A 2017-08-08 2017-09-24 47
18 Type A 2017-10-30 2017-12-24 55
27 Type A 2017-02-08 2017-03-26 46

Campaign A was inactive for all the other days when Campaign A was active.

Campaign A Inactive Duration = 365 - 196 = 169

5. Promotion Analysis

5.1 Analysis

a. Outlier Detection on transactions when Campaign A was Active VS. No Promotion Period

  • The first plot depicts the total sales distribution by when the Campaign A was running vs. when Campaign A was inactive.

  • We have also observed some outlier transactions which were exorbitantly large in both the periods. We perceive these transactions which were not driven by any promotion/campaign and could be driven by other factors.

  • The reader might remember that this dataset in the one we prepared (using wrangling procedures) in our data preparation section.

b. Analyzing total sales in campaign A vs. when campaign A was inactive by holiday season

  • Further we have also analyzed the sales during the holiday season, as holiday is one of the biggest drivers of sale. To rule out the hypothesis that higher sales (when campaign A was running) could have been due to the holiday season, we did a comparative analysis of sales in Campaign A during the holiday season and during normal period.

Distribution of total sales by basket id with the median value shown as a white dot. can be seen that the distribution was even throughout the year for campaign A. The distribution when campaign A was not running did not have as strong of a tail.

c. Analyzing sales volume in campaign A vs. when campaign A was inactive by holiday season

  • Also total sales might not be the correct representation of sales as we are in the promotion period where many products are sold on a discounted price, BOGO or multi-save and conditional promos.

  • Hence, we did a comparative analysis on Sales Volume (dollar per unit) in Campaign A during the holiday season and during normal period.

Distribution of sales volume by basket id with the median value shown as a white dot. It can be seen that the distribution was even throughout the year for campaign A. The distribution for campaign A was much more spread out than the other campaigns.

e. Analyzing total sales and sales volume when Campaign A was running VS. when Campaign A was inactive

The data was grouped by basket in order to examine the distribution. The data was grouped by campaign in order to examine the difference between campaign A and transactions when campaing A was inactive. Most of the sales volume fall between 0 & 20 with the total sales between $0 & $40.

- The above plot clearly shows that total sales and sales volume was relatively higher while in campaign A as compared to when the promotion was inactive. - The data was grouped by basket ID which is an unique indicator of sales quantity under a transaction

5.2 Measure of Promotion Effectiveness

a. Expenditure by Hour VS. Time of Year

Expenditure Per Hour (EPH) is a measure coefficient to measure how effective was the sales for a given period of time. The duration can be defined as the time when the promotion campaign was active or the time when there was no promotion active. We have considered EPH on hourly basis to calculate dollar value per hour. EPH is derived by the following formula:

\(\sum (No. of Transactions)/\sum (Total duration)\)

The calculation of EPH for the dataset is provided in the previous section - “Data Preparation”

  • Generally, a mean EPH of 0.3 and higher for a promotion is considered to be a “successful promotion”.
  • We experienced a mean of 0.53 for Campaign A and 0.13 of mean EPH when campaign A was not running.
## [1] 0.5344225
  • Furthermore, we also calculated EPH for transactions in a period where campaign A was not running and we considered both the holiday and the non-holiday season.

- From the above plot it is a clear indication that the major driver of sales were the promotion programs that ran during campaign A. Despite of heavy sales volume experienced in the holiday season, campaign A had a strong mean EPH and higher total sales & sales volume even during the duration when there was no promotion running.

5.2 Demographic Deep-Dive Analysis when Campaign A was active vs. inactive

  • Taking a step further, we are analyzing the customer segments that shopped during the campaign period and also when the campaign was inactive.
  • We agreed that the customer segmentation would be best observed in both the periods (Campaign A active Vs. Not Active) only when we consider unique households that shopped during both the duration.
  • We have considered top 3 income range that had most sales against their household IDs.

The demographics are shown as how the relate to the campaign. The blue lines represent campaign A and the green lines represent sales when campaign was inactive.

5.3 Product Sales Deep-Dive Analysis when Campaign A was active vs. inactive

- Now, once we came to the conclusion that Campaign A was effective in all the sales season, we have further dived a level deeper to analyze the top products that contributed to the sales when Campaign A was active.

- We observed that Milk, Frozen Products and Meat were the top contributors towards sales in both the durations.

Further, to strengthen our assertion about effectiveness of Campaign A on specific products, we have plotted the Average sales volume per basket for Milk, Frozen Products & Meat. These products have been specifically filtered out by COMMON household_ID as they projected highest sales during both the durations.

5.3.1 Average Meat Item Sales Per Basket by Campaign Type & Time of Year

5.3.2 Average Frozen Items Sales Per Basket by Campaign Type & Time of Year

5.3.3 Average Milk Items Sales Per Basket by Campaign Type & Time of Yea

6. Summary & Recommendations

6.1: Interesting Insights:

  1. Consumer spending in Campaign A seems to be highest if consumers are married, have a household size of 2 to 3, earn somewhere between $50-99k, and who purchase heavily in both holiday and non-holiday season.

  2. The important product segments is Meat, Milk & Frozen Items with highest sales. These products have exhibited tremendous sales in both the duration of when the campaign A was active vs. inactive and also during the holiday season vs. non-holiday season.

  3. Mean EPH of 0.3 and higher for a promotion is considered to be a “successful promotion”. We experienced a mean of 0.53 for Campaign A and 0.13 of mean EPH when campaign A was not running.

6.2: Recommendations

Few recommended actions to boost short-term sales during promotion periods are listed below:

1. To maximize customer response:
Collect data from existing members for a certain time period, determine customer-to-promotion category match for individual customers, and send single offers accordingly.

2. To predict new customer behavior
Additional data (e.g., occupation, education level, location) might be requested in the profile section, when a new customer signs up to be a Regork to test for possible features that might contribute to a customer-to-promotion match.

3. To maximize success rate of Campaign A:
- Send single offers via social networks and mobile app.
- Offer discount and BOGO promotions with low purchase requirement.